simulated user
Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents
Shen, Shannon Zejiang, Chen, Valerie, Gu, Ken, Ross, Alexis, Ma, Zixian, Ross, Jillian, Gu, Alex, Si, Chenglei, Chi, Wayne, Peng, Andi, Shen, Jocelyn J, Talwalkar, Ameet, Wu, Tongshuang, Sontag, David
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Michigan (0.04)
- (4 more...)
- Banking & Finance (1.00)
- Education (0.93)
Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems
Jin, Song, Zhang, Juntian, Liu, Yuhan, Zhang, Xun, Zhang, Yufei, Yin, Guojun, Jiang, Fei, Lin, Wei, Yan, Rui
Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing platforms often lack a mechanism for user actions to dynamically reshape the environment. To bridge this gap, we introduce RecInter, a novel agent-based simulation platform for recommender systems featuring a robust interaction mechanism. In RecInter platform, simulated user actions (e.g., likes, reviews, purchases) dynamically update item attributes in real-time, and introduced Merchant Agents can reply, fostering a more realistic and evolving ecosystem. High-fidelity simulation is ensured through Multidimensional User Profiling module, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought (CoT) enriched interaction data. Our platform achieves significantly improved simulation credibility and successfully replicates emergent phenomena like Brand Loyalty and the Matthew Effect. Experiments demonstrate that this interaction mechanism is pivotal for simulating realistic system evolution, establishing our platform as a credible testbed for recommender systems research. Our codes are available at https://github.com/jinsong8/RecInter.
- North America > United States > California (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
Magentic-UI: Towards Human-in-the-loop Agentic Systems
Mozannar, Hussein, Bansal, Gagan, Tan, Cheng, Fourney, Adam, Dibia, Victor, Chen, Jingya, Gerrits, Jack, Payne, Tyler, Maldaner, Matheus Kunzler, Grunde-McLaughlin, Madeleine, Zhu, Eric, Bassman, Griffin, Alber, Jacob, Chang, Peter, Loynd, Ricky, Niedtner, Friederike, Kamar, Ece, Murad, Maya, Hosn, Rafah, Amershi, Saleema
AI agents powered by large language models are increasingly capable of autonomously completing complex, multi-step tasks using external tools. Yet, they still fall short of human-level performance in most domains including computer use, software development, and research. Their growing autonomy and ability to interact with the outside world, also introduces safety and security risks including potentially misaligned actions and adversarial manipulation. We argue that human-in-the-loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems. We introduce Magentic-UI, an open-source web interface for developing and studying human-agent interaction. Built on a flexible multi-agent architecture, Magentic-UI supports web browsing, code execution, and file manipulation, and can be extended with diverse tools via Model Context Protocol (MCP). Moreover, Magentic-UI presents six interaction mechanisms for enabling effective, low-cost human involvement: co-planning, co-tasking, multi-tasking, action guards, and long-term memory. We evaluate Magentic-UI across four dimensions: autonomous task completion on agentic benchmarks, simulated user testing of its interaction capabilities, qualitative studies with real users, and targeted safety assessments. Our findings highlight Magentic-UI's potential to advance safe and efficient human-agent collaboration.
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- Europe > Middle East > Malta > Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Asia > Singapore (0.04)
- (2 more...)
- Workflow (1.00)
- Research Report > New Finding (0.34)
Fashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items
In Conversational Recommendation Systems (CRS), a user provides feedback on recommended items at each turn, leading the CRS towards improved recommendations. Due to the need for a large amount of data, a user simulator is employed for both training and evaluation. Such user simulators critique the current retrieved item based on knowledge of a single target item. However, system evaluation in offline settings with simulators is limited by the focus on a single target item and their unlimited patience over a large number of turns. To overcome these limitations of existing simulators, we propose Fashion-AlterEval, a new dataset that contains human judgments for a selection of alternative items by adding new annotations in common fashion CRS datasets. Consequently, we propose two novel meta-user simulators that use the collected judgments and allow simulated users not only to express their preferences about alternative items to their original target, but also to change their mind and level of patience. In our experiments using the Shoes and Fashion IQ as the original datasets and three CRS models, we find that using the knowledge of alternatives by the simulator can have a considerable impact on the evaluation of existing CRS models, specifically that the existing single-target evaluation underestimates their effectiveness, and when simulatedusers are allowed to instead consider alternative relevant items, the system can rapidly respond to more quickly satisfy the user.
- Europe > Czechia > Prague (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis
Li, Hanyu, Liu, Haoyu, Zhu, Tingyu, Guo, Tianyu, Zheng, Zeyu, Deng, Xiaotie, Jordan, Michael I.
Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM-simulated user. Agent performance is judged by comparing its final numerical output to the human-derived baseline. Initial results show that even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on < 50% of the tasks, highlighting limitations not evident in single-turn tests. This work underscores the need to improve LLMs' multi-round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > Jordan (0.14)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (11 more...)
- Workflow (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Education (0.93)
- Transportation (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
An Agent-Based Modeling Approach to Free-Text Keyboard Dynamics for Continuous Authentication
Continuous authentication systems leveraging free - text keyboard dynamics offer a promising additional layer of security in a multifactor authentication setup that can be used in a transparent way with no impact on user experience. This study investigates t he efficacy of behavioral biometrics by employing an Agent - Based Model (ABM) to simulate diverse typing profiles across mechanical and membrane keyboards. Specifically, we generated synthetic keystroke data from five unique agents, capturing features relat ed to dwell time, flight time, and error rates within sliding 5 - second windows updated every second. Two machine learning approaches, One - Class Support V ector Machine (OC - SVM) and Random Forest (RF), were evaluated for user verification. Results revealed a stark contrast in performance: while One - Class SVM failed to differentiate individual users within each group, Random Forest achieved robust intra - keyboard user recognition (Accuracy > 0.7) but struggled to generalize across keyboards for the same user, h ighlighting the significant impact of keyboard hardware on typing behavior. These findings suggest that: (1) keyboard - specific user profiles may be necessary for reliable authentication, and (2) ensemble methods like RF outperform One - Class SVM in capturing fine - grained user - specific patterns. Keywords: keyboard dynamics, continuous authentication, agent - based modeling, One - Class SVM, Random Forest, behavioral biometrics.
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Asia > Singapore (0.04)
Search-Based Interaction For Conversation Recommendation via Generative Reward Model Based Simulated User
Wang, Xiaolei, Xia, Chunxuan, Li, Junyi, Meng, Fanzhe, Huang, Lei, Wang, Jinpeng, Zhao, Wayne Xin, Wen, Ji-Rong
Conversational recommendation systems (CRSs) use multi-turn interaction to capture user preferences and provide personalized recommendations. A fundamental challenge in CRSs lies in effectively understanding user preferences from conversations. User preferences can be multifaceted and complex, posing significant challenges for accurate recommendations even with access to abundant external knowledge. While interaction with users can clarify their true preferences, frequent user involvement can lead to a degraded user experience. To address this problem, we propose a generative reward model based simulated user, named GRSU, for automatic interaction with CRSs. The simulated user provides feedback to the items recommended by CRSs, enabling them to better capture intricate user preferences through multi-turn interaction. Inspired by generative reward models, we design two types of feedback actions for the simulated user: i.e., generative item scoring, which offers coarse-grained feedback, and attribute-based item critique, which provides fine-grained feedback. To ensure seamless integration, these feedback actions are unified into an instruction-based format, allowing the development of a unified simulated user via instruction tuning on synthesized data. With this simulated user, automatic multi-turn interaction with CRSs can be effectively conducted. Furthermore, to strike a balance between effectiveness and efficiency, we draw inspiration from the paradigm of reward-guided search in complex reasoning tasks and employ beam search for the interaction process. On top of this, we propose an efficient candidate ranking method to improve the recommendation results derived from interaction. Extensive experiments on public datasets demonstrate the effectiveness, efficiency, and transferability of our approach.
- Leisure & Entertainment (0.68)
- Media > Film (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Program Synthesis Dialog Agents for Interactive Decision-Making
Toles, Matthew, Balwani, Nikhil, Singh, Rattandeep, Rodriguez, Valentina Giulia Sartori, Yu, Zhou
Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on user features. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, highlighting the need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is crucial that these agents ask the right questions. As agents determine when to terminate a conversation, they face a trade-off between accuracy and the number of questions asked, a key metric for both user experience and cost. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. To address this, we introduce ProADA, a novel approach that leverages program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 55.6 while maintaining nearly the same number of dialog turns.
- North America > United States > New York (0.04)
- Asia > Middle East > Jordan (0.04)
- Law (0.91)
- Health & Medicine > Therapeutic Area > Immunology (0.67)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.85)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.82)
- (2 more...)
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
Li, Ryan, Zhang, Yanzhe, Yang, Diyi
Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting accessibility and impeding efficient design iteration. To bridge this gap, we introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. Beyond end-to-end benchmarking, Sketch2Code supports interactive agent evaluation that mimics real-world design workflows, where a VLM-based agent iteratively refines its generations by communicating with a simulated user, either passively receiving feedback instructions or proactively asking clarification questions. We comprehensively analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs; even the most capable models struggle to accurately interpret sketches and formulate effective questions that lead to steady improvement. Nevertheless, a user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception, highlighting the need to develop more effective paradigms for multi-turn conversational agents.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Florida > Bay County > Lynn Haven (0.04)
- Information Technology (0.67)
- Law (0.46)
Towards Goal-Oriented Agents for Evolving Problems Observed via Conversation
Free, Michael, Langworthy, Andrew, Dimitropoulaki, Mary, Thompson, Simon
The objective of this work is to train a chatbot capable of solving evolving problems through conversing with a user about a problem the chatbot cannot directly observe. The system consists of a virtual problem (in this case a simple game), a simulated user capable of answering natural language questions that can observe and perform actions on the problem, and a Deep Q-Network (DQN)-based chatbot architecture. The chatbot is trained with the goal of solving the problem through dialogue with the simulated user using reinforcement learning. The contributions of this paper are as follows: a proposed architecture to apply a conversational DQN-based agent to evolving problems, an exploration of training methods such as curriculum learning on model performance and the effect of modified reward functions in the case of increasing environment complexity.
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)